Prediction of various blood group systems using Korean whole-genome sequencing data

Aims This study established blood group analysis methods using whole-genome sequencing (WGS) data and conducted blood group analyses to determine the domestic allele frequency using public data from the Korean whole sequence analysis of the Korean Reference Genome Project conducted by the Korea Disease Control and Prevention Agency (KDCA). Materials and methods We analyzed the differences between the human reference sequences (hg19) and the conventional reference cDNA sequences of blood group genes using the Clustal Omega website, and established blood group analysis methods using WGS data for 41 genes, including 39 blood group genes involved in 36 blood group antigens, as well as the GATA1 and KLF1 genes, which are erythrocyte-specific transcription factor genes. Using CLC genomics Workbench 11.0 (Qiagen, Aarhus, Denmark), variant analysis was performed on these 41 genes in 250 Korean WGS data sets, and each blood group’s genotype was predicted. The frequencies for major alleles were also investigated and compared with data from the Korean rare blood program (KRBP) and the Erythrogene database (East Asian and all races). Results Among the 41 blood group-related genes, hg19 showed variants in the following genes compared to the conventional reference cDNA: GYPA, RHD, RHCE, FUT3, ACKR1, SLC14A1, ART4, CR1, and GCNT2. Among 250 WGS data sets from the Korean Reference Genome Project, 70.6 variants were analyzed in 205 samples; 45 data samples were excluded due to having no variants. In particular, the FUT3, GNCT2, B3GALNT1, CR1, and ACHE genes contained numerous variants, with averages of 21.1, 13.9, 13.4, 9.6, and 7.0, respectively. Except for some blood groups, such as ABO and Lewis, for which it was difficult to predict the alleles using only WGS data, most alleles were successfully predicted in most blood groups. A comparison of allele frequencies showed no significant differences compared to the KRBP data, but there were differences compared to the Erythrogene data for the Lutheran, Kell, Duffy, Yt, Scianna, Landsteiner-Wiener, and Cromer blood group systems. Numerous minor blood group systems that were not available in the KRBP data were also included in this study. Conclusions We successfully established and performed blood group analysis using Korean public WGS data. It is expected that blood group analysis using WGS data will be performed more frequently in the future and will contribute to domestic data on blood group allele frequency and eventually the supply of safe blood products.


Introduction
Human red blood cells contain many blood group antigens. To date, 43 blood group systems containing 345 red cell antigens have been officially recognized by the International Society of Blood Transfusion (ISBT) [1]. The diversity of blood group antigens is primarily due to single nucleotide polymorphisms (SNP) in blood group genes. Blood group antigen typing is classically conducted using serological testing by hemagglutination, but recently this process has been automated. DNA-based molecular diagnostics (genotyping) have replaced serological methods. That is, SNP-based molecular testing and Sanger sequencing are used to analyze specific SNPs, alongside DNA microarray methods in clinical laboratories [2][3][4]. Molecular testing is easy to automate, can be multiplexed, and does not require expensive and difficult-tofind antisera, making it possible to test a broader range of blood types for patients and donors, and to help identify donors and select blood products for rare blood types [5]. However, SNPbased molecular diagnosis or Sanger sequencing have limitations as they cannot include all known blood group genes or detect new blood group antigen alleles. Several commercial multiplexed molecular diagnostic kits are currently available, but they do not cover all known blood group genes, and at most, they only identify 35-37 of red blood cell antigens from 10-11 blood groups. Erythrocyte genotyping using next-generation sequencing (NGS) has several advantages [6][7][8]. NGS enables the evaluation of whole-genome sequences to detect gene rearrangements and analyze copy numbers. NGS can detect new alleles in addition to known SNPs and establish new weak or silencing alleles. One study performed blood group analysis using NGS data from 2,504 people provided by the 1000 Genomes Project, but Korean data were not included [9].
The Korean rare blood program (KRBP), known as the Korean national recipient registry, was established in July 2013 [10,11]. The definition of a rare blood group depends on the prevalence of blood antigens in a specific population. Accurate data on the frequencies of various blood antigens are essential for a rare blood program, which can then be used to predict the availability of blood products for use in patients with the corresponding antibodies. We used commercially available multiplex molecular assays to establish the rare donor program and explored the prevalence of various blood group antigens. However, not all known blood group antigens were included. The present study established blood group analysis methods using whole-genome sequencing (WGS) data and conducted blood group analyses to determine the domestic allele frequencies. These were compared with previous KRBP data and data from other ethnic groups using public data from the Korean whole-genome sequencing analysis of the Korean Reference Genome Project conducted by the Korea Disease Control and Prevention Agency (KDCA).

Materials and methods
Difference analysis between human reference sequences (hg19) and conventional reference cDNA Conventional reference alleles and coding DNA sequences (CDS) were investigated for 41 genes (Table 1 and S1 Table), including 39 blood group genes involved in 36 blood group antigens, and the GATA1 and KLF1 genes, which are erythrocyte-specific transcription factor genes [12][13][14]. The conventional reference alleles for 40 genes were available directly from ISBT and FUT3 alleles were available from the Blood Group Antigen Gene Mutation Database (dbRBC). The human reference genome (hg19) UCSC genomic transcripts (corresponding to the splicing pattern of the conventional cDNA sequence) for these 41 genes were also investigated using the UCSC genome browser [15]. The CDS of the conventional reference alleles and the human reference genomes for each gene were aligned, and the Clustal Omega website was used to identify nucleotide changes [16].We then analyzed the differences between hg19 and conventional reference cDNA and determined the blood group alleles of hg19 (Table 1). We described our overall work flow in S1 Fig.

Establishment of blood group analysis methods using WGS data
After importing the WGS data (BAM file) using CLC genomics Workbench 11.0 (Qiagen, Aarhus, Denmark) [17], the data were realigned to hg19, and variant analysis was performed on the coding regions of the 41 blood group-related genes. The alleles of each blood group were predicted by analyzing the variants for each gene and comparing them with the hg19 genotype.

Blood group analysis using Korean WGS public data
We received the 250 Korean WGS data (BAM files) of the Korean Reference Genome Project through the Human Resource Distribution Desk of the National Institute of Health of the KDCA. Variant analysis was performed on 41 blood group-related genes in the Korean WGS data using the above method, and the alleles of each blood group were predicted. The frequencies of the major alleles were also investigated and compared with the frequencies in the previous KRBP data and the Erythrogene database (East Asians and all races) using data from 2,504 people from 26 races of the 1000 Genomes Project.

Statistical analysis
Chi-square test and Fisher's exact test were applied to compare the allele frequencies and data with P values <0.05 were considered statistically significant. Statistical analyses were performed using MedCalc software, version 19.8 (MedCalc Software Ltd., Ostend, Belgium)

Ethics statement
This study uses public data, and since it uses already anonymized data, it does not contact the research subjects, uses information that has already been disclosed to the public and it was NGS data (BAM file) that did not include other medical records. This study was approved by the Institutional Ethics Committee of Seoul National University Bundang Hospital with waiver of consent and review exemption (IRB No. X-1801-447-906 and X-1903-528-901).

Difference between hg19 and conventional reference cDNA
We investigated and analyzed the differences between the human reference sequences (hg19) and the CDS of the conventional reference alleles for 41 blood group-related genes. Table 1 lists the ISBT number, blood system name and symbol, gene name, chromosomal location, conventional reference allele, and phenotype for the 41 blood group-related genes analyzed in this study. Table 1 also lists the differences between hg19 and conventional cDNA for these 41 genes, including nucleotide changes, predicted amino acid changes, predicted allele name, and predicted phenotype. Among these 41 genes, hg19 showed variants in the following genes compared to the conventional reference cDNA: GYPA, RHD, RHCE, FUT3, ACKR1, SLC14A1, ART4, CR1, and GCNT2.

Blood group analysis using Korean WGS public data
Among the 250 WGS data sets of the Korean Reference Genome Project, an average of 70.6 variants were analyzed in 205 data samples. We excluded 45 data samples due to their showing no variants. The FUT3, GNCT2, B3GALNT1, CR1, and ACHE genes showed several variants, having an average of 21.1, 13.9, 13.4, 9.6, and 7.0 variants, respectively, making allele prediction for these genes difficult.
Other alleles were successfully predicted in most blood groups, except for ABO and Lewis, for which it was difficult to predict alleles and phenotypes from the WGS data alone. Table 2 lists the blood group system predictions from the representative data compared to conventional cDNA sequences. Table 3 and Fig 1 show the allele frequencies compared with previous KRBP data sets and the Erythrogene (East Asian and all races) database. Allele frequencies were similar for the Lutheran, Kell, Duffy (FY � 01), Yt, Landsteiner-Wiener, and Cromer blood group systems compared to the KRBP data, but there were differences compared to the Erythrogene data. This study included numerous minor blood group systems that were not available in the KRBP data, and most showed allele frequency differences compared to the Erythrogene data.

Discussion
This study established blood group analysis methods using WGS data and analyzed the blood groups using Korean WGS public data. Blood group gene analysis differs from the genetic analysis used to diagnose tumors or congenital genetic diseases. There is a conventional reference allele for each blood group, and the SNPs and blood group genotypes according to the reference allele are well documented. There are several blood group gene databases. The Blood Group Antigen Mutation (BGMUT) Database was created by the Human Genome Variation Society (HGVS) in 1999 [14]. Since 2006, it has been operated by the National Institutes of Health (NIH) as part of the database Red Blood Cells (dbRBC) of the National Center for Biotechnology Information (NCBI), which ceased operation in October 2017. In 2016, Moller et al. created the Erythrogene database following analysis of 36 blood groups from the 1000 Genomes Project [9]. There are also the ISBT website [13], the Blood Group Antigen Facts-Book [18], and BOOGIE [19]. Since the blood group genotypes of hg19 are not the same as the conventional reference alleles of each blood group, we first analyzed the blood group genotypes of the human reference genome and noted the differences compared to the conventional reference alleles. Since most variant analysis software is designed to find variants by comparing the nucleotide sequences to the human reference genome, variant analysis is performed on blood group-related genes in the same way as other genetic analyses. Therefore, the results of variant detection alone cannot determine the blood group types. In this study, the differences between hg19 and conventional cDNA were analyzed first (Table 1), and we used these results to conduct blood group analyses in the WGS data. WGS data analysis, similar to other NGS data analyses, undergoes the same process of read mapping and variant detection after aligning  the sequences to the human reference sequence. Then the alleles are analyzed alongside the phenotype of each blood group using the detected variants. Alleles were successfully predicted in most blood groups except for ABO (ABO) and Lewis (FUT3), which were difficult to predict using WGS data alone. ABO and Lewis antigens are carbohydrate antigens synthesized by enzymes [20]. The A and B genes of the ABO blood group are alleles located in the same position; having a blood type A means having the phenotype of A, but even with the same phenotype, the genotype may be AA or AO. Over 200 genotypes have been reported for ABO, which include nucleotide changes in various regions. However, the vcf file obtained following variant detection cannot distinguish the two alleles. Moreover, because carbohydrate antigen prediction requires the evaluation of several related genes, and the alleles associated with carbohydrate antigens are complex [5], determination of alleles based only on WGS data is challenging, and support is needed to establish the phenotyping results. It was difficult to determine the blood type using only the WGS data for the Lewis blood group, for which factors other than FUT2 and FUT3 genes are involved in the blood group phenotype.
Additionally, none of the variants of the RH (RHD and RHCE) and MNS (GYPA and GYPB) blood system were detected in the Korean public WGS data. However, we visually confirmed that the mapping and variant detection of these genes were performed successfully. Although the data were not included in this study, we also conducted the blood group analysis using the WGS data from clinical samples, and more variants were detected. The public data used was obtained from the Human Resource Distribution Desk of the National Institute of Health of the KDCA. These data sets were collected through the Korean Reference Genome Project between 2012 and 2014 using HiSeq2000 (Illumina, San Diego, CA, USA) analysis with a maximum of 30× depth coverage per sample [21]. The BAM files were already aligned to hg19 and we used hg19 as the reference genome. We judged that insufficient variant data ISBT, International Society of Blood Transfusion; ND, not determined https://doi.org/10.1371/journal.pone.0269481.t002 was detected due to the lower coverage and poorer data quality compared to the clinical patient samples. High numbers of variants were detected in the ACHE (Yt), CR1 (Knops), GCNT2 (I), and B3GALNT1(Globoside) genes. However, it is not known whether most of these nucleotide changes encode antigenic epitopes [13], and we were able to predict the alleles in most cases ( Table 2). We also investigated the frequencies of the major alleles and compared them to the frequencies of the KRBP and Erythrogene (East Asian and all races) data from the 1000 Genomes Project (Table 3). The 1000 Genomes Project contains data from 2,504 people from 26 races, divided into African (661), American (347), East Asian (504), European (503), and South Asian (489); however, the East Asians only include Chinese, Japanese, and Vietnamese people, but not Koreans [9]. Although it is well known that allele frequencies vary among ethnic groups, few studies exist on the allele frequencies in different blood group systems [10,22]. Allele frequencies of Lutheran, Kell, Duffy, Diego, Yt, Scianna, Dombrock, Landsteiner-Wiener, and Cromer blood group systems were available in the KRBP data and Lutheran, Kell, Duffy (FY � 01), Yt, Landsteiner-Wiener, and Cromer blood group systems showed no significant differences between this study and KRBP study in allele frequencies. However, the allele frequencies of Lutheran, Kell, Duffy (FY � 01), and Yt blood group systems showed significant differences from the Erythogene (East Asian and all races) data and allele frequency of Cromer blood group system showed significant difference from the Erythrogene (all races). The highfrequency alleles LU � 02, KEL � 02, FY � 01, YT � 01, SC � 01, DO � 02, and CROM � 01 were more frequent in Koreans than in East Asians or all races in the Erythrogene database. In the Duffy system, the allele frequency of FY � 02 showed significant difference from the KRBP data (P = 0.023), but this was because the allele with 125A>G and 199C>T nucleotide changes was detected as the FY � 02 allele in the KRBP study. In the Diego system, the prevalence of the Di a antigen is extremely rare in people of European or African descent, but is about 5% in people of Chinese or Japanese ancestry and has an even higher prevalence in the indigenous peoples of North and South America, reaching 54% [20]. The antigen prevalence of Di a is 6.4-14.5% in Koreans [23]. The allele frequency of DI � 01 was 6.92% in the KRBP data, 0.30% in the East Asian Erythrogene data, and 0.18% in all the Erythrogene data. However, no DI � 01 variants were detected in the Korean public WGS data. This could be due to the smaller sample size or poor quality of the WGS data. In the Dombrock system, the frequency of the DO � 01 allele is lower in this study than in the KRBP study because the hg19 itself is DO � 02, so the variant may not have been detected due to the low data quality. Numerous minor blood group systems that were not included in the KRBP data are included in our study. Antibodies to many of these antigens are rarely encountered because they are high-prevalence antigens in most populations; however, the Colton, Gerbich, RHAG, JR, LAN, and Vel systems can cause acute hemolytic transfusion reactions or hemolytic diseases in newborns [24,25]. Accurate data on the frequencies of various blood group antigens are essential to predict the availability of compatible blood components for use in patients with the corresponding antibodies and are indispensable for rare blood program. Accurate information on the frequencies of specific antigennegative blood units will help reduce unnecessary antigen testing and avoid delays in issuing blood units to patients. Furthermore, it will contribute to improving blood transfusion safety and better blood supply management.

Conclusion
We successfully established blood group analysis methods using WGS data and performed blood group analyses on Korean public WGS data. There were some limitations in this study in terms of the number and quality of the WGS data sets. Also, additional tests such as serologic tests or further molecular assays could not be performed. Nevertheless, even using WGS or whole-exome sequencing (WES) data, which is not intended for blood group genotyping, we were able to analyze the various blood group alleles using the method established in this study. In addition, accumulating frequency data for diverse blood group systems will enable safe blood products and the provision of adequate blood supplies for patients with the relevant antibodies.
Supporting information S1 Table. The source of the conventional reference alleles for the 41 blood group genes. (DOCX) S1 Fig. Graphic work flow of this study. Part 1: Conventional reference alleles and coding DNA sequences (CDS) were investigated for 41 genes, including 39 blood group genes involved in 36 blood group antigens, and the GATA1 and KLF1 genes, which are erythrocytespecific transcription factor genes (12). The human reference genome (hg19) UCSC genomic transcripts (corresponding to the splicing pattern of the conventional CDS) for these 41 genes were also investigated using the UCSC genome browser (15). The CDS of the conventional reference alleles and the human reference genomes for each gene were aligned, and the Clustal Omega website was used to identify nucleotide changes (16).We then analyzed the differences between hg19 and conventional reference cDNA and determined the blood group alleles of hg19. Part 2: After importing the 250 Korean WGS data (BAM) using CLC genomic Workbench 11.0 (Qiagen, Aarhus, Denmark) (17), the data were realigned to hg19, and variant analysis was performed on the coding regions of the 41 blood group-related genes. The alleles of each blood group were predicted by analyzing the variants for each gene and comparing them with the hg19 genotype. Abbreviations: CDS, coding DNA sequences; WGS, whole-genome sequencing.